[GPU] Optimize merge memory usage #136411

ldematte · 2025-10-10T15:14:14Z

This PR changes how we gather and compact vector data for transmitting them to the GPU. Instead of using a temporary file to write out the compacted arrays, we use directly the vector values from the scorer supplier, which are backed by a memory mapped input. This way we avoid an additional copy of the data.

elasticsearchmachine · 2025-10-10T15:14:39Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

libs/simdvec/src/main/java/org/elasticsearch/simdvec/QuantizedByteVectorValuesAccess.java

distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/SystemJvmOptions.java

...rc/main/java/org/elasticsearch/index/codec/vectors/reflect/VectorsFormatReflectionUtils.java

mayya-sharipova

@ldematte Great work, I have not tested it yet, but amazing work how you organized it. My main comment: do you think we can simplify this PR by breaking into two separate ones: making this PR only about changes to merges, and doing changes for flush, ResourcesHolder, 128Mb in a separate PR? Or these changes are tightly coupled?

...rc/main/java/org/elasticsearch/index/codec/vectors/reflect/VectorsFormatReflectionUtils.java

x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java

ldematte · 2025-10-13T06:20:59Z

doing changes for flush, ResourcesHolder, 128Mb in a separate PR?

I can do that: here is the PR #136464

x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java

…space

mayya-sharipova · 2025-10-14T21:13:13Z

@ldematte Great changes. I have done some benchmarking on my laptop with int8, and I see great recall but surprisingly no speedups as compared with main branch:

gist: 1_000_000 docs; 960 dims; euclidean metric

index_type	index_time (ms)	force_merge_time (ms)	QPS	single segment recall
gpu main	61422	69010	353	0.97
gpu PR	59035	67766	296	0.98

cohere-wikipedia_v2: 934_024 docs; 768 dims; cosine metric

index_type	index_time (ms)	force_merge_time (ms)	QPS	single segment recall
gpu main	48164	47657	384	0.99
gpu PR	47824	47354	393	0.99

x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java

mayya-sharipova

Great work, @ldematte

ldematte · 2025-10-15T06:54:25Z

@mayya-sharipova I also expected speed-ups on force merge; it seems to be a bit better, but it's some "%", not "x".
I think this could be better in a "real" scenario (maybe even rally), where the disk is contended (search ops, translog, etc. -- we do have exclusive use of the drive here).
I simulated this by adding a background copy operation to keep the disk somehow busy, and you see it's more relevant. Still "%", not "x", but at least you can tell it's there and it's not noise.

…space

ldematte · 2025-10-16T07:00:28Z

@mayya-sharipova I updated merge as agreed, to avoid using directly device memory due to the cuVS bug.
I'll wait for your re-review; you can just look at the latest commit. Thanks!

mayya-sharipova

@ldematte Thanks, the latest changes to copy to a separate memory segment LGTM

…space

ldematte · 2025-10-20T15:47:52Z

I have benchmarked merge performances for this, with both KnnIndexTester and ES.

TL;DR: performances for merge are improved by 18/20%, but that gets "lost" in high level benchmarks (they give almost identical results, within the variance).

BUT with this change, ES uses no additional disk space (which in the case of 1M vectors can be 5GB!) and in the case of float32 the memory footprint of the process (working set) is reduced too (int8 will also be fixed once cuvs if fixed).

This PR changes how we gather and compact vector data for transmitting them to the GPU. Instead of using a temporary file to write out the compacted arrays, we use directly the vector values from the scorer supplier, which are backed by a memory mapped input. This way we avoid an additional copy of the data.

elasticsearchmachine · 2025-10-20T16:08:06Z

💚 Backport successful

Status	Branch	Result
✅	9.2

This PR changes how we gather and compact vector data for transmitting them to the GPU. Instead of using a temporary file to write out the compacted arrays, we use directly the vector values from the scorer supplier, which are backed by a memory mapped input. This way we avoid an additional copy of the data.

ldematte added 2 commits October 10, 2025 14:28

Use the internal raw vector data during merge, avoid additional tmp file

c56c707

Fix access

0819dbd

ldematte requested a review from a team as a code owner October 10, 2025 15:14

ldematte added >non-issue auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search test-gpu Run tests using a GPU v9.2.1 v9.3.0 labels Oct 10, 2025

ldematte requested review from ChrisHegarty and mayya-sharipova October 10, 2025 15:14

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 10, 2025

ldematte commented Oct 10, 2025

View reviewed changes

libs/simdvec/src/main/java/org/elasticsearch/simdvec/QuantizedByteVectorValuesAccess.java Show resolved Hide resolved

ldematte mentioned this pull request Oct 10, 2025

Expose vector values from Int7SQVectorScorerSupplier #136416

Merged

Expose vector values from Int7SQVectorScorerSupplier

48a3f7c

ldematte changed the title ~~[Gpu] Optimize merge memory usage~~ [GPU] Optimize merge memory usage Oct 10, 2025

Merge branch 'main' into gpu/optimize-merge-space

7cede4c